Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

RNA-Seq Data Analysis ◾ 177

or healthy and diseased) but it can also be a complex study that includes more than a

single factor (factorial design). Once the study design has been determined as metadata,

inferential statistics is used to identify which gene has statistically significant change in

expression compared to the same gene in a reference sample. The fundamental step in

the differential expression analysis is to model the association between gene counts (Y)

and the covariates (conditions) of interest (X). The number of replicates is crucial for the

statistical differential analysis. Most of the time, the number of replicates in an RNA-Seq

study is small. Instead of non-parametric statistical analysis, most programs for RNA-Seq

data analysis use generalized linear models (GLMs) by assuming that the count data fol-

lows a certain statistical distribution. That approach also assumes that each RNA-Seq read

is sampled independently from a population of reads and the read is either aligned to the

gene g or not. When the read is aligned to the gene g, we call this a success and otherwise

is a failure. The process of random trials with two possible outcomes (success or failure) is

called Bernoulli’s process. Thus, according to the probability theory, the number of reads

(successes), Yg, for a given gene g from sample j follows a binomial distribution.

(

)

~Binomial

(5.8)

Assume Ygj is the number of reads sequenced from sample j, nj represents the number of

independent trials in Bernoulli’s process,

π is the probability of success (a read is aligned

to the gene g in sample j), and

−

is the probability of failure

Assume also that for the gene g on the sample j and that gene has the length lg and read

count Ygj. All possible positions in g that can produce a read can be described as Ygj lg [30].

Thus, probability of success,

π , is given as

Y l

gj g

∑

(5.9)

where G is the number of genes in the sample.

According to the binomial distribution, the mean of the read counts is given as

(5.10)

The probability that the number of reads (X

) for a given gene is given by

P Y

(

)

(

)













−

(5.11)

However, since in RNA-Seq count data, a very large number of reads are represented and

the probability of aligning a read to a gene is very small, the Poisson distribution is more

appropriate than the binomial distribution if the mean of read counts of a gene is equal to

the variance as the Poisson distribution assumes.